-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Support kNN filter on nested metadata #113949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support kNN filter on nested metadata #113949
Conversation
Current knn search over nested vectors only supports filtering on parent's meatadata. This adds support for filtering over nested metadata. Closes elastic#106994
Documentation preview: |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
Hi @mayya-sharipova, I've created a changelog YAML for you. |
// If filter matches non-nested docs, we assume this is a filter over parents docs, | ||
// so we will modify it accordingly: matching parents docs with join to its child docs | ||
if (nestedHelper.mightMatchNonNestedDocs(filterQuery, parentPath)) { | ||
// Ensure that the query only returns parent documents matching `filterQuery` | ||
filterQuery = Queries.filtered(filterQuery, parentFilter); | ||
filterQuery = new ToChildBlockJoinQuery(filterQuery, parentBitSet); | ||
} | ||
// Now join the filterQuery & parentFilter to provide the matching blocks of children | ||
filterQuery = new ToChildBlockJoinQuery(filterQuery, parentBitSet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work when filtering over both a nested & top level metadata?
Something like:
{ "bool": { "filter" : [{"nested" : {"path": ..., {...}}}, {"term": {"top-level-thing": "foo"}}]}}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we run a query like this where this a filter on both parent and nested metadata, it will NOT work, and return 0 results:
{
"query": {
"nested": {
"path": "paragraphs",
"query": {
"knn": {
"field": "paragraphs.vector",
"query_vector": [
1,
1,
1
],
"k": 2,
"filter": [
{
"match": {
"paragraphs.language": "FR"
}
},
{
"match": {
"title": "title2"
}
}
]
}
},
"inner_hits": {
"size": 2
}
}
}
}
I can not think of anyway to support both filters like this. I suggest we can document this as a limitation, and advice to use a filter for parent's metadata as a post-filter. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its weird to me that this doesn't work:
POST test/_search
{
"query": {
"nested": {
"path": "nested",
"query": {
"knn": {
"field": "nested.vector",
"query_vector": [
1,
1,
1,
1,
1
],
"k": 2,
"filter": [
{
"nested": {
"path": "nested",
"query": {
"match": {
"nested.language": "EN"
}
}
}
}
]
}
},
"inner_hits": {
"size": 2
}
}
}
}
It seems to me that filter
should have a "top level view" of the docs and allow you to have a separate query that gets executed that just works and joins to the appropriate knn level. I am not sure if that is possible given the current API. If its truly a hard restriction, we should indicate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent Sorry, I got confused with your last comment.
Why is it weird for you that the query you provided above doesn't work? I think it is mal-formed query, we already provided "nested" context on top, so knn filter should not contain "nested". That's how other nested queries work: inside nested
query, we query only on nested metadata, and to my mind , it was wrong to allow filter on parents' metadata.
It seems to me that filter should have a "top level view" of the docs and allow you to have a separate query that gets executed that just works and joins to the appropriate knn level.
Can you please clarify what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayya-sharipova I need to think a bit more. Its difficult to fully resolve the user experience between the top-level kNN object (which has NO signal that its nested) and a query which obviously has a nested context as its inside a nested query.
This & simplicity was why I thought (and mostly still do), think filtering on top-level metadata by default is most useful.
I am not even sure how script_score
actually works in these various scenarios.
Originally, Jim proposed a new query type that would allow a nested
context to join back to the parent docs and then provide a parent level filter (reverse_join_parent type thing). Maybe that is the only way we can do this.
But I do not think we should ship this without the ability to filter on both (parent & nested) at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is that old POC of jim's: main...jimczi:elasticsearch:nested_dense_vector
I honestly am not sure how to square this well.
But, I do think one of the following is going to be necessary:
- The filter for kNN's need to be treated like a top-level query, meaning it completely ignores its nested context and allows for a completely separate set filters. This is admittedly difficult (maybe impossible with how we do parent ID joining while search the graph).
- We detect that somebody is doing nested search and try to rewrite it so that it works :/ (yeah, I am not sure this is actually possible, again)
- We keep the current behavior, but add a
reverse_nested
query so that folks can do a complex combination of nested & higher level filters. I am unsure if this is possible while maintaining current BWC.
These are some ideas, I realize that they are various levels of impossible/difficult. We can talk synchronously to brain storm further if you want.
Each individual filter should be either on parent or nested.
@mayya-sharipova Yes, KNN query still feels weird in many cases. I understand that there are many reasons for that, and that there is a desire to find a better, more intuitive API. Yet, looking at this MR, it looks like it works, improves on some limitations of current building blocks, and potentially solves some real problems your users experience now. I work on enterprise search, and we want to filter on the parent document level as well as fields of nested paragraphs, holding things like detected entities or sentiment. And from what I see (looking at the test), this change solves exactly this problem. I understand that a more complex filtering logic, like mixing parent and nested child filters (e.g., wrapped in a Boolean query and placed inside the I'm pretty sure you'll find a better way of expressing such queries one day. But to me, it looks like enough time was spent, and I don't think you should wait any longer here. |
@mayya-sharipova, with this change, are we focussing on queries based on same level data? |
...-spec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/100_knn_nested_search.yml
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java
Outdated
Show resolved
Hide resolved
// filter can be both over parents or nested docs, so add them as should clauses to a filter | ||
BoolQueryBuilder adjustedFilter = new BoolQueryBuilder().should(filter) | ||
.should(new ToChildBlockJoinQueryBuilder(filter)); | ||
boolQuery.filter(adjustedFilter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lucene doesn't break here?
I guess we just do the right thing? I know Lucene will complain if we attempt to query things that are or are not child docs vs. parent docs. etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, Lucene ToChildBlockJoinQuery
doesn't allow a filter that can also match child docs. But in our ToChildBlockJoinQueryBuilder::doToQuery, I've added extra parentFilter to ensure that only parents docs can match:
// ensure that parentQuery only applies to parent docs by adding parentFilter
return new ToChildBlockJoinQuery(Queries.filtered(parentQuery, parentFilter), parentBitSet);
|
||
|
||
--- | ||
"Test filter on nested fields with filter on both nested and parent metadata is not allowed": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test/scenario where we are doing a filter on another nested field?
so, assume we have
{
"nested_1": {"properties": {"vector"....}},
"nested_2": {"properties": {"product_name":....}}
}
Then the filter would be:
filter: [{ nested: {path: nested_2, query: {match: { nested_2.product_name: "FR" }}]
or something (unsure if I got all the nesting right...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, I think our "auto-nested check" should only apply in the case of the filter matching the vectors nested context. Any other filter should be applied as a "top level" filter and return the relevant parent doc ids (I am unsure if this is exactly the case, but it seems to be right now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, replicate such a test in the knn query tests if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up.
In acbfccd I've added a test for nested sibling fields. Currently:
- For nested knn query, we can NOT filter on nested sibling fields, as the query DSL itself doesn't allow it.
- For top level knn search, we can filter on nested sibling fields, and effectively addressing the 128803 issue.
Are you not happy with this behaviour, are you saying we should NOT allow filter on sibling nested fields in nested knn search and throw an exception, which would mean we should close 128803
as not going to implement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For nested knn query, we can NOT filter on nested sibling fields, as the query DSL itself doesn't allow it.
This is weird. A separate nested field should be able to join up to the parent level, indicate the parents that are filtered and we join back down to the child level to filter on the vectors. I don't know how to express that in the DSL, but we should document it as a limitation.
server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java
Outdated
Show resolved
Hide resolved
…ada field" As it introduces a breaking change This reverts commit e3b0251.
nested2.value: "domestic" | ||
inner_hits: { size: 3, "fields": [ "nested.paragraph_id", "nested.language"], _source: false } | ||
|
||
- match: { hits.total.value: 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent Notice how for nested query the DSL itself doesn't allow to filter on nested sibling fields, as the top level query:nested:path sets the parent context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayya-sharipova thank you.
I was thinking it would be possible via boolean
with two separate nested contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They key issue is that with kNN search, we need to be clear that its within a SINGLE nested context (either top level, or its own context).
That gets tricky, and maybe we can make it better in the future. But the current compromise unblocks some folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not think of any way to support pre-filtering on nested sibling fields in knn query. We can do the following, but it would be post-filter:
- do:
search:
index: test
body:
_source: false
query:
bool:
must:
- nested:
path: nested
query:
knn:
field: nested.vector
query_vector: [-0.5, 90.0, -10, 14.8, -156.0]
k: 10
filter:
- nested:
path: nested2
query:
bool:
filter:
- match:
nested2.key: "category"
- match:
nested2.value: "domestic"
nested2.key: "category" | ||
- match: | ||
nested2.value: "domestic" | ||
- match: { hits.total.value: 2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here with the current implementation we can query nested sibling fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was sort of hoping that we could do something similar in the nested query since we look to the parent level. But it gets too complicated given the nested context?
@benwtrent Thanks for your latest feedback, but I am not sure what is the path forward:
|
I think this is OK. In the query DSL, we declare the KNN query to be particularly within a given nested context. For the top level it makes me wonder WHY it works...it likely has to do with the nested context... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current limitations are understandable. We just need to be clear in documentation and testing whats actually doable.
Thank you for iterating on this.
nested2.key: "category" | ||
- match: | ||
nested2.value: "domestic" | ||
- match: { hits.total.value: 2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was sort of hoping that we could do something similar in the nested query since we look to the parent level. But it gets too complicated given the nested context?
Hi @mayya-sharipova, I've updated the changelog YAML for you. |
@elasticsearchmachine run "elasticsearch-ci/8.17.10 / Part 2 / bwc-snapshots" |
@elasticsearchmachine run elasticsearch-ci/bwc-snapshots-part2 |
This is to update our documentation that we now can support knn filter on nested metadata. Related to elastic/elasticsearch#113949
This is to update our documentation that we now can support knn filter on nested metadata. Related to elastic/elasticsearch#113949 --------- Co-authored-by: Liam Thompson <[email protected]>
Current knn search over nested vectors only supports filtering on parent's metadata.
This adds support for filtering over nested metadata.
Closes #106994
Closes #128803